mlm model
Enabling Autoregressive Models to Fill In Masked Tokens
Israel, Daniel, Grover, Aditya, Broeck, Guy Van den
Historically, LLMs have been trained using either autoregressive (AR) or masked language modeling (MLM) objectives, with AR models gaining dominance in recent years. However, AR models are inherently incapable of masked infilling, which is the ability to predict masked tokens between past and future context. In contrast, MLM models suffer from intrinsic computational inefficiencies during both training and inference that hinder their scalability. This work introduces MARIA (Masked and Autoregressive Infilling Architecture), a novel approach that leverages the strengths of both paradigms to achieve state-of-the-art masked infilling performance. MARIA combines a pre-trained MLM and AR model by training a linear decoder that takes their concatenated hidden states as input. This minimal modification enables the AR model to perform infilling while retaining its inherent advantages in terms of faster inference with KV caching. Our results demonstrate that MARIA significantly outperforms existing methods, namely discrete diffusion models, on masked infilling tasks.
- Asia > Middle East > Israel (0.04)
- North America > United States > California > Los Angeles County > Los Angeles (0.04)
- Asia > Middle East > Jordan (0.04)
Review for NeurIPS paper: Pre-training via Paraphrasing
Weaknesses: The overall idea of retrieving related texts for pre-training is similar to REALM. The adopted retrieval method needs more refinement in detail. It is roughly based on overall document cosine similarity, which may involve much noise. Besides, the retrieval task is latently problematic, as it is closely related to the training target of the model and is not capable of reflecting the effectiveness of the encoder. The machine translation also does not measure the performance of the encoder because it relies on the decoder to generate the correct target sequence. The experiments and experimental settings are insufficient.
Generating Synthetic Free-text Medical Records with Low Re-identification Risk using Masked Language Modeling
Belkadi, Samuel, Ren, Libo, Micheletti, Nicolo, Han, Lifeng, Nenadic, Goran
In this paper, we present a system that generates synthetic free-text medical records, such as discharge summaries, admission notes and doctor correspondences, using Masked Language Modeling (MLM). Our system is designed to preserve the critical information of the records while introducing significant diversity and minimizing re-identification risk. The system incorporates a de-identification component that uses Philter to mask Protected Health Information (PHI), followed by a Medical Entity Recognition (NER) model to retain key medical information. We explore various masking ratios and mask-filling techniques to balance the trade-off between diversity and fidelity in the synthetic outputs without affecting overall readability. Our results demonstrate that the system can produce high-quality synthetic data with significant diversity while achieving a HIPAA-compliant PHI recall rate of 0.96 and a low re-identification risk of 0.035. Furthermore, downstream evaluations using a NER task reveal that the synthetic data can be effectively used to train models with performance comparable to those trained on real data. The flexibility of the system allows it to be adapted for specific use cases, making it a valuable tool for privacy-preserving data generation in medical research and healthcare applications.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- Europe > United Kingdom > England > Greater Manchester > Manchester (0.04)
- (2 more...)
- Research Report > New Finding (0.68)
- Research Report > Experimental Study (0.46)
Exploration of Masked and Causal Language Modelling for Text Generation
Micheletti, Nicolo, Belkadi, Samuel, Han, Lifeng, Nenadic, Goran
Large Language Models (LLMs) have revolutionised the field of Natural Language Processing (NLP) and have achieved state-of-the-art performance in practically every task in this field. However, the prevalent approach used in text generation, Causal Language Modelling (CLM), which generates text sequentially from left to right, inherently limits the freedom of the model, which does not decide when and where each token is generated. In contrast, Masked Language Modelling (MLM), primarily used for language understanding tasks, can generate tokens anywhere in the text and any order. This paper conducts an extensive comparison of MLM and CLM approaches for text generation tasks. To do so, we pre-train several language models of comparable sizes on three different datasets, namely 1) medical discharge summaries, 2) movie plot synopses, and 3) authorship verification datasets. To assess the quality of the generations, we first employ quantitative metrics and then perform a qualitative human evaluation to analyse coherence and grammatical correctness. In addition, we evaluate the usefulness of the generated texts by using them in three different downstream tasks: 1) Entity Recognition, 2) Text Classification, and 3) Authorship Verification. The results show that MLM consistently outperforms CLM in text generation across all datasets, with higher quantitative scores and better coherence in the generated text. The study also finds \textit{no strong correlation} between the quality of the generated text and the performance of the models in the downstream tasks. With this study, we show that MLM for text generation has great potential for future research and provides direction for future studies in this area.
- Oceania > Australia (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > United Kingdom > England > Greater Manchester > Manchester (0.04)
- (4 more...)
- Media > Film (0.46)
- Health & Medicine > Therapeutic Area (0.46)
- Law > Criminal Law (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Federated Learning of Medical Concepts Embedding using BEHRT
Shoham, Ofir Ben, Rappoport, Nadav
Electronic Health Records (EHR) data contains medical records such as diagnoses, medications, procedures, and treatments of patients. This data is often considered sensitive medical information. Therefore, the EHR data from the medical centers often cannot be shared, making it difficult to create prediction models using multi-center EHR data, which is essential for such models' robustness and generalizability. Federated Learning (FL) is an algorithmic approach that allows learning a shared model using data in multiple locations without the need to store all data in a central place. An example of a prediction model's task is to predict future diseases. More specifically, the model needs to predict patient's next visit diagnoses, based on current and previous clinical data. Such a prediction model can support care providers in making clinical decisions and even provide preventive treatment. We propose a federated learning approach for learning medical concepts embedding. This pre-trained model can be used for fine-tuning for specific downstream tasks. Our approach is based on an embedding model like BEHRT, a deep neural sequence transduction model for EHR. We train using federated learning, both the Masked Language Modeling (MLM) and the next visit downstream model. We demonstrate our approach on the MIMIC-IV dataset. We compare the performance of a model trained with FL against a model trained on centralized data. We find that our federated learning approach reaches very close to the performance of a centralized model, and it outperforms local models in terms of average precision. We also show that pre-trained MLM improves the model's average precision performance in the next visit prediction task, compared to an MLM model without pre-training. Our code is available at https://github.com/nadavlab/FederatedBEHRT.
- North America > United States > Virginia (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- North America > United States > Indiana (0.04)
- (2 more...)
- Research Report > New Finding (0.94)
- Research Report > Experimental Study (0.68)
Short Text Pre-training with Extended Token Classification for E-commerce Query Understanding
Jiang, Haoming, Cao, Tianyu, Li, Zheng, Luo, Chen, Tang, Xianfeng, Yin, Qingyu, Zhang, Danqing, Goutam, Rahul, Yin, Bing
E-commerce query understanding is the process of inferring the shopping intent of customers by extracting semantic meaning from their search queries. The recent progress of pre-trained masked language models (MLM) in natural language processing is extremely attractive for developing effective query understanding models. Specifically, MLM learns contextual text embedding via recovering the masked tokens in the sentences. Such a pre-training process relies on the sufficient contextual information. It is, however, less effective for search queries, which are usually short text. When applying masking to short search queries, most contextual information is lost and the intent of the search queries may be changed. To mitigate the above issues for MLM pre-training on search queries, we propose a novel pre-training task specifically designed for short text, called Extended Token Classification (ETC). Instead of masking the input text, our approach extends the input by inserting tokens via a generator network, and trains a discriminator to identify which tokens are inserted in the extended input. We conduct experiments in an E-commerce store to demonstrate the effectiveness of ETC.
- Information Technology > Information Management > Search (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.94)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.70)
Adversarial Contrastive Pre-training for Protein Sequences
McDermott, Matthew B. A., Yap, Brendan, Hsu, Harry, Jin, Di, Szolovits, Peter
Recent developments in Natural Language Processing (NLP) demonstrate that large-scale, self-supervised pre-training can be extremely beneficial for downstream tasks. These ideas have been adapted to other domains, including the analysis of the amino acid sequences of proteins. However, to date most attempts on protein sequences rely on direct masked language model style pre-training. In this work, we design a new, adversarial pre-training method for proteins, extending and specializing similar advances in NLP. We show compelling results in comparison to traditional MLM pre-training, though further development is needed to ensure the gains are worth the significant computational cost.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.06)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)